Skip to content

feat: support lp_norm operator on metax#1359

Open
LindseyMei wants to merge 1 commit into
InfiniTensor:mainfrom
LindseyMei:feat/metax-lp-norm
Open

feat: support lp_norm operator on metax#1359
LindseyMei wants to merge 1 commit into
InfiniTensor:mainfrom
LindseyMei:feat/metax-lp-norm

Conversation

@LindseyMei

Copy link
Copy Markdown

Add MetaX backend for lp_norm.

Changes

  • Create src/infiniop/ops/lp_norm/metax/lp_norm_metax.h and src/infiniop/ops/lp_norm/metax/lp_norm_metax.maca, reusing ../cuda/kernel.cuh.
  • Wire MetaX dispatch into operator.cc (CREATE / GET workspace size / CALCULATE / DESTROY).
  • Make lp_norm/cuda/kernel.cuh compile on MACA by excluding MetaX from the CUDART_VERSION >= 12090 ::cuda::maximum() branch; MetaX uses the cub::Max() / lambda fallback.

Validation

  • python3 test/infiniop/lp_norm.py --metax passes on MetaX C500 for F16/BF16/F32, including contiguous and strided axis=-1 cases.
  • clang-format --dry-run --Werror clean on all 4 modified/new files.
  • Build: xmake -y -j4 + xmake install -y succeeded.

Notes

  • This PR is based on the current main branch.
  • The CUDART guard change is a minimal portability fix; the scalar cub::Max() path is already used by other MetaX operators (e.g. reduce/cuda/reduce.cuh).

Add MetaX backend for lp_norm:
- Create metax/lp_norm_metax.h and metax/lp_norm_metax.maca
- Wire MetaX dispatch into operator.cc (CREATE/GET/CALCULATE/DELETE)
- Make lp_norm/cuda/kernel.cuh compile on MACA by excluding MetaX
  from the CUDART_VERSION >= 12090 ::cuda::maximum() branch

Validation: python3 test/infiniop/lp_norm.py --metax passes on MetaX C500
(F16/BF16/F32, contiguous and strided axis=-1 cases).

Signed-off-by: LindseyMei <648816901@qq.com>
@LindseyMei LindseyMei requested a review from a team July 1, 2026 07:55
@LindseyMei

Copy link
Copy Markdown
Author

MetaX lp_norm Performance Report

Environment

  • GPU: MetaX C500 64GB
  • MACA: 3.3.0.15
  • Repository: InfiniTensor/InfiniCore
  • Branch: feat/metax-lp-norm
  • Commit: 27f2ef05
  • Benchmark script: /data/workspace/lp_norm_bench.py
  • Method: explicit torch.cuda.synchronize(), 10 warmup + 300 iterations

What changed

Added a MetaX backend for the lp_norm operator. The implementation reuses the existing lp_norm/cuda/kernel.cuh with a minimal portability fix for MACA (exclude MetaX from the CUDART_VERSION >= 12090 ::cuda::maximum() branch so it falls back to cub::Max() / lambda reduction).

Correctness

python3 test/infiniop/lp_norm.py --metax passes on MetaX C500 for F16/BF16/F32, covering:

  • contiguous tensors with axis = 0 / 1 / 2 / 3
  • strided tensors with axis = -1
  • inplace / out-of-place

Performance (libinfiniop)

shape axis p dtype lib_ms Gelem/s GB/s
(4096, 4096) -1 2 F16 0.587 28.57 114.3
(4096, 4096) -1 2 F32 0.594 28.23 225.9
(4096, 4096) -1 2 BF16 0.600 27.96 111.8
(4096, 4096) -1 1 F16 0.596 28.13 112.5
(4096, 4096) -1 1 F32 0.595 28.21 225.7
(4096, 4096) -1 1 BF16 0.601 27.93 111.7
(8192, 8192) -1 2 F16 1.620 41.43 165.7
(8192, 8192) -1 2 F32 1.653 40.59 324.7
(8192, 8192) -1 2 BF16 1.695 39.60 158.4
(8192, 8192) -1 1 F16 1.638 40.97 163.9
(8192, 8192) -1 1 F32 1.665 40.32 322.5
(8192, 8192) -1 1 BF16 1.694 39.61 158.4
(16384, 16384) -1 2 F16 5.570 48.19 192.8
(16384, 16384) -1 2 F32 5.714 46.98 375.8
(16384, 16384) -1 2 BF16 5.847 45.91 183.6
(16384, 16384) -1 1 F16 5.608 47.87 191.5
(16384, 16384) -1 1 F32 5.694 47.14 377.1
(16384, 16384) -1 1 BF16 5.789 46.36 185.5
(12, 16, 512, 512) 3 2 F16 1.181 42.62 170.5
(12, 16, 512, 512) 3 2 F32 1.224 41.13 329.0
(12, 16, 512, 512) 3 2 BF16 1.236 40.71 162.9
(12, 16, 512, 512) 3 1 F16 1.178 42.74 171.0
(12, 16, 512, 512) 3 1 F32 1.235 40.76 326.1
(12, 16, 512, 512) 3 1 BF16 1.233 40.82 163.3

Observations

  • lp_norm is a reduction-style normalization (max reduction + p-norm reduction + elementwise scale), so its effective bandwidth is lower than pure elementwise kernels, as expected.
  • F32 reaches ~375 GB/s on the largest 2D tensor, indicating the kernel is memory-bandwidth-bound for large reductions.
  • F16/BF16 bandwidth is roughly half of F32 because the kernel accumulates in float and does not yet use half-precision vector loads; this is consistent with the current CUDA implementation.
  • Performance is stable across p=1 and p=2.

Known limitations

  • The shared kernel accumulates intermediate results in float regardless of input dtype, which limits F16/BF16 bandwidth.
  • No warp-level vectorization is applied yet; future work could explore half2 / float4 loads for the elementwise scale loop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant